202 research outputs found

    Efficient and accurate P-value computation for Position Weight Matrices

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Position Weight Matrices (PWMs) are probabilistic representations of signals in sequences. They are widely used to model approximate patterns in DNA or in protein sequences. The usage of PWMs needs as a prerequisite to knowing the statistical significance of a word according to its score. This is done by defining the P-value of a score, which is the probability that the background model can achieve a score larger than or equal to the observed value. This gives rise to the following problem: Given a P-value, find the corresponding score threshold. Existing methods rely on dynamic programming or probability generating functions. For many examples of PWMs, they fail to give accurate results in a reasonable amount of time.</p> <p>Results</p> <p>The contribution of this paper is two fold. First, we study the theoretical complexity of the problem, and we prove that it is NP-hard. Then, we describe a novel algorithm that solves the P-value problem efficiently. The main idea is to use a series of discretized score distributions that improves the final result step by step until some convergence criterion is met. Moreover, the algorithm is capable of calculating the exact P-value without any error, even for matrices with non-integer coefficient values. The same approach is also used to devise an accurate algorithm for the reverse problem: finding the P-value for a given score. Both methods are implemented in a software called TFM-PVALUE, that is freely available.</p> <p>Conclusion</p> <p>We have tested TFM-PVALUE on a large set of PWMs representing transcription factor binding sites. Experimental results show that it achieves better performance in terms of computational time and precision than existing tools.</p

    Multivariate Analysis and Visualization of Splicing Correlations in Single-Gene Transcriptomes

    Get PDF
    BACKGROUND: RNA metabolism, through 'combinatorial splicing', can generate enormous structural diversity in the proteome. Alternative domains may interact, however, with unpredictable phenotypic consequences, necessitating integrated RNA-level regulation of molecular composition. Splicing correlations within transcripts of single genes provide valuable clues to functional relationships among molecular domains as well as genomic targets for higher-order splicing regulation. RESULTS: We present tools to visualize complex splicing patterns in full-length cDNA libraries. Developmental changes in pair-wise correlations are presented vectorially in 'clock plots' and linkage grids. Higher-order correlations are assessed statistically through Monte Carlo analysis of a log-linear model with an empirical-Bayes estimate of the true probabilities of observed and unobserved splice forms. Log-linear coefficients are visualized in a 'spliceprint,' a signature of splice correlations in the transcriptome. We present two novel metrics: the linkage change index, which measures the directional change in pair-wise correlation with tissue differentiation, and the accuracy index, a very simple goodness-of-fit metric that is more sensitive than the integrated squared error when applied to sparsely populated tables, and unlike chi-square, does not diverge at low variance. Considerable attention is given to sparse contingency tables, which are inherent to single-gene libraries. CONCLUSION: Patterns of splicing correlations are revealed, which span a broad range of interaction order and change in development. The methods have a broad scope of applicability, beyond the single gene – including, for example, multiple gene interactions in the complete transcriptome

    A comprehensive RNA-Seq-based gene expression atlas of the summer squash (Cucurbita pepo) provides insights into fruit morphology and ripening mechanisms

    Get PDF
    Background: Summer squash (Cucurbita pepo: Cucurbitaceae) are a popular horticultural crop for which there is insufficient genomic and transcriptomic information. Gene expression atlases are crucial for the identification of genes expressed in different tissues at various plant developmental stages. Here, we present the first comprehensive gene expression atlas for a summer squash cultivar, including transcripts obtained from seeds, shoots, leaf stem, young and developed leaves, male and female flowers, fruits of seven developmental stages, as well as primary and lateral roots. Results: In total, 27,868 genes and 2352 novel transcripts were annotated from these 16 tissues, with over 18,000 genes common to all tissue groups. Of these, 3812 were identified as housekeeping genes, half of which assigned to known gene ontologies. Flowers, seeds, and young fruits had the largest number of specific genes, whilst intermediate-age fruits the fewest. There also were genes that were differentially expressed in the various tissues, the male flower being the tissue with the most differentially expressed genes in pair-wise comparisons with the remaining tissues, and the leaf stem the least. The largest expression change during fruit development was early on, from female flower to fruit two days after pollination. A weighted correlation network analysis performed on the global gene expression dataset assigned 25,413 genes to 24 coexpression groups, and some of these groups exhibited strong tissue specificity. Conclusions: These findings enrich our understanding about the transcriptomic events associated with summer squash development and ripening. This comprehensive gene expression atlas is expected not only to provide a global view of gene expression patterns in all major tissues in C. pepo but to also serve as a valuable resource for functional genomics and gene discovery in Cucurbitaceae

    Statistical significance of cis-regulatory modules

    Get PDF
    BACKGROUND: It is becoming increasingly important for researchers to be able to scan through large genomic regions for transcription factor binding sites or clusters of binding sites forming cis-regulatory modules. Correspondingly, there has been a push to develop algorithms for the rapid detection and assessment of cis-regulatory modules. While various algorithms for this purpose have been introduced, most are not well suited for rapid, genome scale scanning. RESULTS: We introduce methods designed for the detection and statistical evaluation of cis-regulatory modules, modeled as either clusters of individual binding sites or as combinations of sites with constrained organization. In order to determine the statistical significance of module sites, we first need a method to determine the statistical significance of single transcription factor binding site matches. We introduce a straightforward method of estimating the statistical significance of single site matches using a database of known promoters to produce data structures that can be used to estimate p-values for binding site matches. We next introduce a technique to calculate the statistical significance of the arrangement of binding sites within a module using a max-gap model. If the module scanned for has defined organizational parameters, the probability of the module is corrected to account for organizational constraints. The statistical significance of single site matches and the architecture of sites within the module can be combined to provide an overall estimation of statistical significance of cis-regulatory module sites. CONCLUSION: The methods introduced in this paper allow for the detection and statistical evaluation of single transcription factor binding sites and cis-regulatory modules. The features described are implemented in the Search Tool for Occurrences of Regulatory Motifs (STORM) and MODSTORM software

    The quest for the solar g modes

    Full text link
    Solar gravity modes (or g modes) -- oscillations of the solar interior for which buoyancy acts as the restoring force -- have the potential to provide unprecedented inference on the structure and dynamics of the solar core, inference that is not possible with the well observed acoustic modes (or p modes). The high amplitude of the g-mode eigenfunctions in the core and the evanesence of the modes in the convection zone make the modes particularly sensitive to the physical and dynamical conditions in the core. Owing to the existence of the convection zone, the g modes have very low amplitudes at photospheric levels, which makes the modes extremely hard to detect. In this paper, we review the current state of play regarding attempts to detect g modes. We review the theory of g modes, including theoretical estimation of the g-mode frequencies, amplitudes and damping rates. Then we go on to discuss the techniques that have been used to try to detect g modes. We review results in the literature, and finish by looking to the future, and the potential advances that can be made -- from both data and data-analysis perspectives -- to give unambiguous detections of individual g modes. The review ends by concluding that, at the time of writing, there is indeed a consensus amongst the authors that there is currently no undisputed detection of solar g modes.Comment: 71 pages, 18 figures, accepted by Astronomy and Astrophysics Revie

    Automating Genomic Data Mining via a Sequence-based Matrix Format and Associative Rule Set

    Get PDF
    There is an enormous amount of information encoded in each genome – enough to create living, responsive and adaptive organisms. Raw sequence data alone is not enough to understand function, mechanisms or interactions. Changes in a single base pair can lead to disease, such as sickle-cell anemia, while some large megabase deletions have no apparent phenotypic effect. Genomic features are varied in their data types and annotation of these features is spread across multiple databases. Herein, we develop a method to automate exploration of genomes by iteratively exploring sequence data for correlations and building upon them. First, to integrate and compare different annotation sources, a sequence matrix (SM) is developed to contain position-dependant information. Second, a classification tree is developed for matrix row types, specifying how each data type is to be treated with respect to other data types for analysis purposes. Third, correlative analyses are developed to analyze features of each matrix row in terms of the other rows, guided by the classification tree as to which analyses are appropriate. A prototype was developed and successful in detecting coinciding genomic features among genes, exons, repetitive elements and CpG islands

    Assessment of clusters of transcription factor binding sites in relationship to human promoter, CpG islands and gene expression

    Get PDF
    BACKGROUND: Gene expression is regulated mainly by transcription factors (TFs) that interact with regulatory cis-elements on DNA sequences. To identify functional regulatory elements, computer searching can predict TF binding sites (TFBS) using position weight matrices (PWMs) that represent positional base frequencies of collected experimentally determined TFBS. A disadvantage of this approach is the large output of results for genomic DNA. One strategy to identify genuine TFBS is to utilize local concentrations of predicted TFBS. It is unclear whether there is a general tendency for TFBS to cluster at promoter regions, although this is the case for certain TFBS. Also unclear is the identification of TFs that have TFBS concentrated in promoters and to what level this occurs. This study hopes to answer some of these questions. RESULTS: We developed the cluster score measure to evaluate the correlation between predicted TFBS clusters and promoter sequences for each PWM. Non-promoter sequences were used as a control. Using the cluster score, we identified a PWM group called PWM-PCP, in which TFBS clusters positively correlate with promoters, and another PWM group called PWM-NCP, in which TFBS clusters negatively correlate with promoters. The PWM-PCP group comprises 47% of the 199 vertebrate PWMs, while the PWM-NCP group occupied 11 percent. After reducing the effect of CpG islands (CGI) against the clusters using partial correlation coefficients among three properties (promoter, CGI and predicted TFBS cluster), we identified two PWM groups including those strongly correlated with CGI and those not correlated with CGI. CONCLUSION: Not all PWMs predict TFBS correlated with human promoter sequences. Two main PWM groups were identified: (1) those that show TFBS clustered in promoters associated with CGI, and (2) those that show TFBS clustered in promoters independent of CGI. Assessment of PWM matches will allow more positive interpretation of TFBS in regulatory regions

    Lethal Mutants and Truncated Selection Together Solve a Paradox of the Origin of Life

    Get PDF
    BACKGROUND: Many attempts have been made to describe the origin of life, one of which is Eigen's cycle of autocatalytic reactions [Eigen M (1971) Naturwissenschaften 58, 465-523], in which primordial life molecules are replicated with limited accuracy through autocatalytic reactions. For successful evolution, the information carrier (either RNA or DNA or their precursor) must be transmitted to the next generation with a minimal number of misprints. In Eigen's theory, the maximum chain length that could be maintained is restricted to 100-1000 nucleotides, while for the most primitive genome the length is around 7000-20,000. This is the famous error catastrophe paradox. How to solve this puzzle is an interesting and important problem in the theory of the origin of life. METHODOLOGY/PRINCIPAL FINDINGS: We use methods of statistical physics to solve this paradox by carefully analyzing the implications of neutral and lethal mutants, and truncated selection (i.e., when fitness is zero after a certain Hamming distance from the master sequence) for the critical chain length. While neutral mutants play an important role in evolution, they do not provide a solution to the paradox. We have found that lethal mutants and truncated selection together can solve the error catastrophe paradox. There is a principal difference between prebiotic molecule self-replication and proto-cell self-replication stages in the origin of life. CONCLUSIONS/SIGNIFICANCE: We have applied methods of statistical physics to make an important breakthrough in the molecular theory of the origin of life. Our results will inspire further studies on the molecular theory of the origin of life and biological evolution

    microPIR: An Integrated Database of MicroRNA Target Sites within Human Promoter Sequences

    Get PDF
    Background: microRNAs are generally understood to regulate gene expression through binding to target sequences within 39-UTRs of mRNAs. Therefore, computational prediction of target sites is usually restricted to these gene regions. Recent experimental studies though have suggested that microRNAs may alternatively modulate gene expression by interacting with promoters. A database of potential microRNA target sites in promoters would stimulate research in this field leading to more understanding of complex microRNA regulatory mechanism. Methodology: We developed a database hosting predicted microRNA target sites located within human promoter sequences and their associated genomic features, called microPIR (microRNA-Promoter Interaction Resource). microRNA seed sequences were used to identify perfect complementary matching sequences in the human promoters and the potential target sites were predicted using the RNAhybrid program..15 million target sites were identified which are located within 5000 bp upstream of all human genes, on both sense and antisense strands. The experimentally confirmed argonaute (AGO) binding sites and EST expression data including the sequence conservation across vertebrate species of each predicted target are presented for researchers to appraise the quality of predicted target sites. The microPIR database integrates various annotated genomic sequence databases, e.g. repetitive elements, transcription factor binding sites, CpG islands, and SNPs, offering users the facility to extensively explore relationships among target sites and other genomi

    Defining Life: The Virus Viewpoint

    Get PDF
    Are viruses alive? Until very recently, answering this question was often negative and viruses were not considered in discussions on the origin and definition of life. This situation is rapidly changing, following several discoveries that have modified our vision of viruses. It has been recognized that viruses have played (and still play) a major innovative role in the evolution of cellular organisms. New definitions of viruses have been proposed and their position in the universal tree of life is actively discussed. Viruses are no more confused with their virions, but can be viewed as complex living entities that transform the infected cell into a novel organism—the virus—producing virions. I suggest here to define life (an historical process) as the mode of existence of ribosome encoding organisms (cells) and capsid encoding organisms (viruses) and their ancestors. I propose to define an organism as an ensemble of integrated organs (molecular or cellular) producing individuals evolving through natural selection. The origin of life on our planet would correspond to the establishment of the first organism corresponding to this definition
    • …
    corecore